Using StatsBomb Data in R

Introduction

In B1700 you have started to learn the basics of R and in the previous practical for B1701 you learned how to load in multiple files at ones. However, there are occasions when your data is not stored in flat files and you may want to pull data from an online database or websites. Going into how to do this without using predefined R library’s is beyond the aims of this course, however, there are many R library’s available which can help you pull data from the web. Examples are:worldfootballR, baseballr, hoopR, SwimmeR, etc. All these packages come with instructions as to how to use them to pull relevant data from a variety of sources and it is worth having a look at some of these. However, for this practical we will use the StatsBombR package. StatsBomb is sports analytics company that specializes in providing data and insights related to football. The company focuses on collecting, analyzing, and delivering detailed statistical information about football matches and players. Most of their data is behind a pay wall, however, they do offer a range of datasets for free and that is the data we will be exploring.

Installing and loading packages

To start, begin by installing the devtools, remotes, and StatsBombR packages by using the following code:

Show the code
# Install packages
install.packages("devtools") 
install.packages("R.utils")
install.packages("YOURFILELOCATION/SDMTools_1.1-221.2.tar.gz", repos=NULL, type="source", dependancies=TRUE)
devtools::install_github("statsbomb/StatsBombR")

Next we need to load these packages.

Show the code
# Load packages
library(StatsBombR)
library(tidyverse)

Loading your data

Once you have successfully installed and loaded all the necessary packages, you can begin reading your data.

StatsBombR uses several functions to load data in to R:

  1. FreeCompetitions() shows all the competition data that is available for free.

  2. FreeMatches() shows the available matches within a competition

  3. StatsBombFreeEvents() shows all the event data for all specified matches.

A useful guide about loading in StatsBomb data using R is provided here.

Loading competition data

Load in all Free Competitions data and assign this to a tibble named CompDF

Show the code
CompDF <- (FreeCompetitions())

Looking at the CompDF dataframe, we see we have data available for 71 competitions. For this practical we are interested in the Men’s European Championship (MEC), we therefore need to create a dataframe which contains the information of only that competition. We could visually scroll through our table to find the competition ID and filter for based on ID, however, if you have a large data set this may be difficult. We can therefore choose to filter based on a few variables we know relate to the MEC. We will use gender (male), season_name (2020), and international competition (TRUE) as filters.

Show the code
MEC_DF <- CompDF %>%
  filter(competition_gender=="male" & season_name==2020 & competition_international==TRUE)
print(MEC_DF)

Loading match data

Now we have created a separate table for just the MEC we can use this to load in all the match data using FreeMatches().

Show the code
MatchesDF <- (FreeMatches(MEC_DF))
Tip

You do not need to create a variable if you are sure the correct data will be filtered out. You could embed the filter into the FreeMatches code using a pipeline as follows:

MatchesDF <- CompDF %>%
  filter(competition_gender=="male" & season_name==2020 & competition_international==TRUE) %>%
  FreeMatches()

Loading event data

We have loaded in all match information for the Men’s Europeans Championship but what we are really after is event data. Event data will give us the opportunity to analyse the performance of individual players and teams.

We will use free_allevents() to load in all event data for all the matches played during the MEC.

Show the answer
ECDataDF <- free_allevents(MatchesDF)

The last step in loading StatsBomb data is using the allclean() function, this is not just a cleaning operation but this function creates some additional variables which may come in useful later on (e.g. location data split in x and y coordinates).

Show the code
ECDataDF <- as.tibble(allclean(ECDataDF))

Once we have finished loading in our data we would like to save the data set as RData.

Show the code
saveRDS(ECDataDF, file="C:/Users/wkb14101/OneDrive - University of Strathclyde/MSc SDA/R Projects/B1701/data/ECData.rds")

In the code above, the select_if() function is used to select columns from the ECData data frame based on a condition. The condition here is is.list, which checks if the values in the columns are of list data type. The result of this operation is a new data frame called list_vars containing only the columns that have list-type values. As we are only interested in the column names we will use list_vars <- c(names(list_vars)) to convert the column names of the list_vars data frame into a vector using the c() function. It effectively stores the names of the columns with list-type values as a character vector. Last we will use the list_vars vector to compare this to the column names within ECData and only keep those which do not match (!) the names in the list_vars. In other words, it removes columns that were identified earlier as having list-type values. We can then save our ECData as a .csv file.

If we did not want to delete the variables which were structured as a list we could have saved our dataset as a RDS file (R Data Store). To do this we could use the saveRDS() function in a similar way as using write.csv() . To open an RDS file we would use readRDS(). In summary, while CSV and Excel formats are easy to use outside of R, they might not be the best choices for preserving list variable structures, the RDS format is recommended for maintaining the integrity of data frames with list variables.

Exercises

Exercise 1: Make sure StatsBombR and tidyverse are installed and loaded.

Show the answer
# Install packages
library(StatsBombR)
library(tidyverse)

Exercise 2: Load all event data for the 2018 National Women’s Soccer League.

Show the answer
CompDF <- (FreeCompetitions())

NWSLDF <- CompDF %>%
  filter(competition_gender=="female" & season_name==2018 & competition_international==FALSE)
print(NWSLDF)
MatchesDF <- (FreeMatches(NWSLDF))
NWSLDataDF <- free_allevents(MatchesDF)
NWSLDataDF <- as.tibble(allclean(NWSLDataDF))

Exercise 3: Save your data file using writeRDS()

Show the answer
saveRDS(NWSLDataDF, "C:/Users/wkb14101/OneDrive - University of Strathclyde/MSc SDA/R Projects/B1701/data/NWSLData.rds")